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Abstract 

We present an algorithm for the stochastic simulation of gene expression and heterogeneous 
population dynamics. The algorithm combines an exact method to simulate molecular-level fluc- 
tuations in single cells and a constant-number Monte Carlo method to simulate time-dependent 
statistical characteristics of growing cell populations. To benchmark performance, we compare 
simulation results with steady-state and time-dependent analytical solutions for several scenar- 
ios, including steady-state and time-dependent gene expression, and the effects on population 
heterogeneity of cell growth, division, and DNA replication. This comparison demonstrates that 
the algorithm provides an efficient and accurate approach to simulate how complex biological 
features influence gene expression. We also use the algorithm to model gene expression dynam- 
ics within 'bet-hedging' cell populations during their adaption to environmental stress. These 
simulations indicate that the algorithm provides a framework suitable for simulating and analyz- 
ing realistic models of heterogeneous population dynamics combining molecular-level stochastic 
reaction kinetics, relevant physiological details and phenotypic variability. 
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1. Introduction 

Stochastic mechanisms play key roles in biological systems since the underlying biochemi- 
cal reactions are subject to molecular-level fluctuations (see e.g. lflTll28l ). Chemical reactions 
are discrete events occurring between randomly moving molecules. Consequently, the timing of 
individual reactions is nondeterministic and the evolution of the number of molecules is inher- 
ently noisy. One example of particular importance is the stochastic expression of gene products 
(mRNA and protein) ifTTI [T2l l20l l23l l28l . Here, molecular-level fluctuations may cause ge- 
neticaUy identical cells in the same environment to display significant variation in phenotypes. 
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loosely defined as any observable biochemical or physical attribute. While such noise is generally 
viewed as detrimental due to reduced precision of signal transduction and coordination, several 
scenarios exist where noise in gene expression may provide a fitness advantage (see Fraser and 
Ksern |6l for a review). For example, it has been proposed that a cell population may enhance its 
ability to reproduce (fitness) by allowing stochastic transitions between phenotypes to increase 
the likelihood that some cells are better positioned to endure unexpected environmental fluctua- 
tions ir|. 

Due to the importance of noise in many biological systems, models involving stochastic 
formulations of chemical kinetics are increasingly being used to simulate and analyze cellular 
control systems (9). In many cases, obtaining analytical solutions for these models are not feasi- 
ble due to the intractability of the corresponding system of nonlinear equations. Thus, a Monte 
Carlo (MC) simulation procedure for numerically calculating the time evolution of a spatially 
homogeneous mixture of molecules is commonly employed fT.^. Among these procedures, the 
Gillespie stochastic simulation algorithm (SSA) is the de-facto standard for simulating biochem- 
ical systems in situations where a deterministic formulation may be inadequate Q- The SSA 
tracks the molecular number of each species in the system as opposed to the variation in con- 
centrations in the deterministic framework. Hence, high network complexity, large separation 
of time-scales and high molecule numbers can result in computationally intensive executions. 
Another challenge is the need for simulating cell populations. In many cases, gene expression 
is measured for 10-100 thousand individuals sampled from an exponentially growing culture of 
continuously dividing cells. While the dynamics of individual cells can be appropriately sim- 
ulated by disregarding daughter cells, repeating such simulations for a fixed number of cells 
may not capture population variability arising from asymmetric division, for example, or age- 
dependent eff'ects. The alternative, tracking and simulating all cells within the population, is 
intractable beyond a few divisions due to an exponential increase in CPU demands as a function 
of time Il22l . 

Here, we present a flexible algorithm to enable simulations of heterogeneous cell popula- 
tion dynamics at single-cell resolution. Deterministic and Langevin approaches to account for 
changes in intracellular content and the constant-number MC method 1 18 , ,31J were previously 
been combined to simulate and analyze gene expression across cell populations 11211 l22l . In these 
studies, extrinsic heterogeneity associated with stochastic division and partitioning mechanisms, 
and intrinsic heterogeneity associated with molecular reaction kinetics were considered. Our al- 
gorithm, which combines the exact SSA for single-cell molecular- level modeling and a constant- 
number MC method for population-level modeling, is designed to incorporate user-defined bi- 
ologically relevant features, such as gene duplication and cell division, as well as single cell, 
lineage and population dynamics at specified sampling intervals. Additionally, the SSA, which 
can be replaced by approximate methods if desired, is implemented within a shared-memory 
CPU parallelization framework to reduce simulation run-times. The emphasis of our study is 
to validate the accuracy of the method by directly comparing simulated results to the analytical 
solutions of models describing increasingly realistic biological features. Our results indicate that 
combining the SSA and the constant-number MC provides an efficient and accurate approach 
to simulate heterogeneous population dynamics, and a reliable tool for the study of population- 
based models of gene expression incorporating physiological detail and phenotypic variability. 

This paper is organized as follows: Sections 2 and 3 briefly introduce the SSA and the 
constant-number MC method, respectively. The developed algorithm is described in Section 4. 
Section 5 provides the results of the benchmarking against analytical results. Finally, in Section 
6, we demonstrate the applicability of the algorithm to more complex contexts by demonstrat- 
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ing that it can quantitatively reproduce experimental measurements of gene expression dynamics 
within 'bet- hedging' cell populations during their adaption to environmental stress. The work is 
summarized in Section 7. 



2. Stochastic Simulation Algorithm 

The physical basis of the stochastic formulation of chemical kinetics is a consequence of 
the fact that collisions in a system of molecules in thermal equilibrium is essentially a random 
process |i8i|. This stochasticity is correctly accounted for by the Gillespie SSA, a MC procedure 
to numerically simulate the time evolution of chemical and biochemical reaction systems. While 
based on an assumption of intracellular homogeneity and mass-action kinetics, it is the de-facto 
standard for simulations of gene expression. In the Direct Method Gillespie SSA, M chemi- 
cal reactions R\, . . . ,Rm with rate constants ci, ...,cm among chemical species X[, ...,Xn, are 
simulated one reaction event at a time. The next reaction to occur (index fi) and its timing (t) 
are determined by calculating M reaction propensities ai,...,aM, given the current number of 
molecules of each of the chemical species, to obtain an appropriately weighted probability for 
each reaction. It can be implemented via the following pseudocode QIHl: 

1: if f < tend and om = H^Hi fli. 5^ then 
2: for ; = 1 , M do 
3: Calculate a, and a; - Yj'v=\ '^v 
4: end for 

5: Generate uniformly distributed random numbers {r\,r2) 

6: Determine when (t - ln(l/ri)/aM) and which (min{ /i | > raffM)) reaction will occur 

7: Set f = f -H T 

8: Update {Xi] 
9: end if 

The SSA can be augmented to incorporate biologically relevant features, such as changes 
in the volume of the cell during growth, the partitioning of cell volume and content at division 
and DNA replication (see e.g. |I2][19]|25]). Changes in cell volume may have significant effects 
on reaction kinetics. First order reactions have deterministic rate constants {wm) and stochastic 
rate constants {cm) that are equal and independent of volume |14]. However, for higher order 
reactions, it is necessary to incorporate cell volume Vit) into the reaction propensities in order to 
perform an exact simulation. For example, the stochastic rate constant for a bimolecular second 
order reaction at time t is given by 



where Na is Avogadro's number Therefore, in the SSA, the rates of higher-order reactions 
must be scaled appropriately by the current cell volume before calculating propensities. This 
procedure has previously been demonstrated to provide a satisfying approximation as long as the 
kinetic time-scale is short compared with the cellular growth rate |fT9l . Typically, the volume of 
each cell k is modeled using an exponential growth law 
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(2) 
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where Vq is the cell volume at the time of its birth, f^,,, is the time and tq is the interval between 
volume doublings. This functional form allows for the description of dilution as a first- order 
decay process within a deterministic model of intracellular concentrations. 

Once the SSA incorporates a continuously increasing cell volume, it is necessary also to 
specify rules that govern cell division. One option is 'sloppy cell-size control' 134] where the cell 
division is treated as a discrete random event that take place with a volume-dependent probability. 
Another simpler option is to assume that division occurs once the cell has exceeded a critical size 
Vdiv corresponding to one doubling of its initial volume, Vdiv = 2yo. The volume doubling time 
To then becomes cell division time and tdiv becomes the time since the last division. When cell 
division is triggered, i.e. when Vkitdh ) > Vdiv, additional rules must be specified to model the 
partitioning of cellular content between mother and daughter cells. For example, asymmetric 
cell division can be modeled by setting Vdaughter < ^mor/icr- The molecules of the cell can then be 
partitioned probabilistically between the two volumes 1141 l27l [30l [33l . 

The SSA can accommodate additional discrete events. For example, the G2/M cell cycle 
checkpoint which ensure proper duplication of the cell's chromosomes before division, can be 
modeled by defining a variable representing the completion of DNA replication such that cell 
division is delayed until the DNA content of the cell has doubled. The replication of individual 
genes, which doubles the maximum rate of gene transcription by doubling the number of cor- 
responding DNA templates, can be modeled as a discrete event that occurs at a fixed time trep 
after cell division, i.e. when f^,,, > t^ep, or as a random event that occurs with some variable 
probability. In both cases, the DNA-replication event can be placed in a cell-specific stack of fu- 
ture events that is compared against tdiv (or t in the above pseudocode) following each SSA step. 
Events in the stack scheduled to occur before this time are then executed and removed from the 
stack. This can be incorporate into the above pseudocode by inserting the following two lines: 

8a: if lengthitevem) ^ then (there are scheduled events) 

8b: if f > teventiO tlicii cxccutc event(i) and delete teventif) from stack 

This approach also provides a convenient basis for simulating the effects of time-delays Il25ll26l . 

We note that the exact SSA can be extremely computationally intensive since the step size 
T becomes very small when the total number of molecules is high or the fastest reaction occurs 
on a time-scale that is much shorter than the time-scale of interest. It therefore useful to develop 
techniques that can be used to speed up the simulation. This can be done, for example, using 
approximate methods such as the tau-leaping procedure in which each time step t advances the 
system through possibly many reaction events |10|. Additionally, since many independent runs 
are required to compute population statistics, parallel computing can be used to further optimize 
simulation run-times. 

3. Constant-Number Monte Carlo 

Implementations of the modified SSA that track only one of the two cells formed during cell 
division may introduce artifacts in the calculation of population characteristics in the presence 
of significant phenotypic variability among cells. For example, gene expression capacity and 
division time may depend on chronological age; old cells may express genes at a reduced rate, 
and daughter cells may need to mature before they can reproduce. In addition, reproductive rates 
may be influenced by the accumulation of genetic mutations within a specific cell lineage or by 
the current levels of gene expression within individual ceUs. To simulate stochastic models of 
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gene expression incorporating such features, it is necessary to couple the SSA with simulation 
techniques used in studies of population dynamics. 

The population balance equation (PBE) is a mathematical statement of continuity that ac- 
counts for all the processes that generate and remove particles from a system of interest (2^, 
including individual members of a population 131]. In a general molecular-dynamics framework, 
the PBE contains terms due to nucleation, coagulation and fragmentation, and so forth, and is 
mathematically represented by an integro-differential equation that typically must be solved nu- 
merically to obtain particle size distribution and densities as a function of time [31J. Due to 
the integro-differential nature of the problem, discretization of the size distribution is required. 
This is problematic because features of the distribution are not known ahead of time and may 
change during growth [15, 311 . To resolve discretization problems that hinder the direct inte- 
gration of the PBE, one can use MC methods to sample a finite subset of a system in order to 
infer its properties and study finite-size effects, spatial correlations, and local fluctuations not 
captured by a mean field approximation iT0l[T8ll24ll3TI . Furthermore, a MC method is appropri- 
ate as its discrete nature adapts itself naturally to growth processes involving discrete events, and 
can simulate growth over arbitrary long times with finite numbers of simulation particles while 
maintaining constant statistical accuracy 1J_8J . 

In order to construct a reliable and efficient algorithm to simulate biological cell populations, 
a constant-number MC method is adopted to simulate the birth-death processes that take place 
within such populations [ 18^ 21^ 22, JJJ. This approach permits modeling of growing populations 
using a fixed number of cells while avoiding the alternative (i.e. an infinitely growing popula- 
tion) by sampling particles representing the population as a whole. It essentially amounts to 
contracting the physical volume represented by the simulation to continuously maintain a con- 
stant number of cells IfTSl . The constant-number MC approach has been successfully applied to 
a variety of non-biological particulate processes llT6l[T8l[3TI as well as cell population dynamics 

ED 123. 

In our implementation of the constant-number MC, we keep track of individual mother and 
daughter cells in two separate arrays. Each time a cell divides, the daughter cell is placed in the 
daughter array and the time of birth recorded. Then, at specified intervals, cells within the mother 
array are replaced one at a time, with the oldest daughter cells being inserted first. Because every 
mother cell is equally likely to be replaced during the sample update, the size distribution of the 
population remains intact for sufficiently large populations [311 . In our case, the size distribution 
corresponds to the distribution of cell ages (or volumes) across the population. 

The constant-number MC method can be represented by the following pseudocode: 

1: if f > tresrore and NC daughter > 1 then 
2: for all NCdaughler dO 

3: Randomly select mother cell 

4: Replace mother cell with oldest available daughter cell 
5: end for 
6: end if 

Here, ti-estore is the interval between population updates and NCdaughter the number of daughter 
cells born since the last update. To avoid simulating the daughters of daughter cells, trestore 
is chosen such that mother cells divide at most once, and daughter cells not at all, during a 
particular trestore interval. 
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4. Algorithm 

Simulations are carried out using an initial population distribution, where gene expression in 
each cell is described by a user defined set of equations, and population statistics are obtained 
at a specified sampling interval. Here, stochastic simulation is earned out using the Gillespie 
direct method |17][8l, however any stochastic simulation method can be implemented. Parallelism 
is implemented across the simulation (see Fig. [T] and pseudocode in this section), as a large 
number of independent simulations need to be performed when simulating the dynamics of a 
cell population, in a shared memory multiprocessor environment. 

The algorithm can be expressed by the flow diagram (Fig. [T} and the following pseudocode: 



while t < te„j do 

begin parallel region 

for all NC population such that t < tsampk do 
Gillespie SSA (see pseudocode in Section 2) 
Update Vk 

Execute events in stack with tgyg„, < f^,v 
if Vk(tdn) > Vdiv then 
Execute cell division 
Increment NC daughter 
end if 
end for 

Update trample 

end parallel region 

Execute constant-number MC (see pseudocode in Section 3) 
Compute statistics 
end while 

Here, NC population is the total number of cells in the population, Vk the volume of cell k, and 
t sample the uscr defined population sampling interval. 

The algorithm can execute simulations of considerable size in reasonable times. For example, 
an IBM with 2 quad-core processors (1.86GHz core s) and 2.0GB of RAM completed a IQ^ s 
simulation of the network presented in Section 



5.1 



for 8000 cells in 81i when vn = Q3s-\ 



vi = 0.05,r', t/o = 0.05^-', c/i = 5 x lO^^i"', fj,,, = 3600^, and Restore = 3300i. 



5. Numerical Results 

In order to evaluate the accuracy of the present algorithm, we compare simulation results 
to steady-state and time-dependent analytical solutions of constitutive gene expression models. 
In this section, models describing increasingly realistic biological features are considered and 
presented along with the derivations of the corresponding analytical solutions. We have in- 
cluded these details to emphasize the significant complexity associated with the derivation of 
even simple kinetic models. Part of our motivation for developing the algorithm is the antic- 
ipation that finding analytical solutions to models incorporating complex biochemical reaction 
network and cellular physiology will be intractable. We begin in Subsection 5.1 by consid- 
ering time-dependent gene expression, i.e., the transcription of RNA and translation of RNA 
into protein, and benchmark this scenario against the corresponding time-dependent analytical 
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distributions. In Subsection 5.2 we consider both time-dependent and time-independent gene ex- 
pression using a model that incorporates the effects of gene dupHcation and cell division on gene 
expression dynamics in individual cells using the constant-number MC method. All simulations 
statistics were obtained from populations consisting of 8000 cells. 

5.1. Time-Dependent Population Distributions 

Population-based simulation algorithms have the advantage of yielding time-dependent population- 
distributions as the output. To evaluate the accuracy of our approach in this respect, validation 
against a time-dependent distribution is of interest. For this purpose, we simulate a two-stage 
gene expression model consisting of the following biochemical reactions: 



T — » r -H niRNA 

do 

mRNA — > 

mRNA mRNA + P 

d\ 



(3) 
(4) 
(5) 
(6) 



where Eq. ([3]l describes transcription at a rate vq, Eq. (j4]) the degradation of the mRNA at a rate 
da, Eq. (jSjl translation at a rate vi , and Eq. (j6]l the protein degradation at a rate d\ . Here, all rates 
are given in probability per unit time and it is assumed that the promoter T is always active and 
thus the model has two stochastic variables, the number of mRNAs and the number of proteins 
P. 

Shahrezaei and Swain |30| studied the system described by Eqs. ([3]l-(|6]) and derived an ap- 
proximative protein distribution as a function of time. The approximation is based on the as- 
sumption that the degradation of mRNA is fast compared to the degradation of proteins (i.e. 
doldi » 1). Consequently, the dynamics of mRNA is at the steady-state for the most of a pro- 
tein's hfetime. The essential steps of the derivation are as follows (see supplementary materials 
in ||30ll for the complete derivation): 

The chemical master equation (CME) describing the probability of having m mRNAs and n 
proteins for the system in Eqs. ( 3]|6 1 at time t is given by 



BP 

— r-^ = Vo{Pm-l,n - Pm,n) + V\m{Pm,n-\ - Pm,n) 

at 

+ do[(m + l)P,„+i,„ - mP„,_„] 
+ di[(n + l)P,„,„+i - «P„,,„]. 

If we let M = z - 1 and v = z - 1, the corresponding generating function F{z , z), defined in Il30l 

as llm,n(z)'"z"Pm,r„ IS givcn by 



1 OF OF 
V dr dv 



-J 



h(\ +u)- 



dF u 
— - a-F, 
ou V 



(7) 



where a - vo/di, b - vi/do, y - doldx, and t - d\t. If r measures the distance along a 
characteristic, which starts at r = with u - uq and v - vq for some constants mq and vq, then 
from Eq. |7]it is found that 

(lu r u 

-y b{\ +u)-- (8) 



du 
d? 



b(l + u) ■ 
1 



using the method of characteristics. Consequently direct integration implies that v - r and Eq.[8] 
has the solution 



m(v) — e 



C 



-by I 



dv 



v'y 



(9) 



for a constant C. By Taylor expansion of e^''^' such that e''^'' - Ymiybv)" I n\ the integral in Eq. 
|9]can be evaluated, and, if Stirling's approximation is subsequently applied, u{y) is found for 
7 » 1 to obey 



U{v) = Mo - 



or 



1 - bv() 
u(v) s 



-yh(v-vo) 



bv 
l-bv 



bv 



(10) 



(11) 



l-bv 

as V = v^e^ > vq for t > 0. When y >> 1, m tends rapidly to a fixed function of v and the 
generating function describing the distribution of proteins can be obtained from Eq. |7] 



dF 
dv 



ab 
1-bv^ 



(12) 



Integrating Eq. [T2]yields the probability distribution for protein number as a function of time 

l-b(z-l)e-^" 



I + b - bz 

By definition of a generating function, expanding F(z) in z yields 



Pn(T) = 



r(fl + n) 

r(n + i)r(fl) 



1 +b 



I + be 



1 +b 



—n, —a, I — a — n; 



1 +b 



+b 



(13) 



(14) 



where 2F\ and F are the hypergeometric and the gamma function, respectively. The initial num- 
ber of proteins n is set to zero. In this case, the mean, variance, and protein noise of the process 
are described respectively by 

^ip{T) = ab{\ - O, (15) 
(^l{T)^^lp{l+b + be-'), 



1 + b + be ^ 
ab(l - e-0 



1/2 



(16) 
(17) 



To benchmark the ability of the algorithm to accurately generate time-dependent population 
distributions, we simulated Eqs. ([3])-(|6]l under conditions where the assumptions of Eq. ([T4| are 
satisfied, and compared the resulting distributions with corresponding time-dependent analytical 
distributions. Fig. |2] shows the simulated and analytical distributions at two different values of 
dimensionless time t. The population statistics, specifically fip and rjp, as a function of t are 
shown in Fig. [3] In both cases, the simulated protein distributions and statistics are in excellent 
agreement with the analytical results (Eqs. ( 14 1-( 17 1). 
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5.2. Gene Duplication, Cell Division, and Time-Dependent Validation 

To explore the accuracy of the algorithm when simulating models incorporating cell growth, 
division, and DNA replication, we implemented the simplified reaction network presented in 
Swain et al. [33 1. The reduced reaction network was obtained from a model of gene expression 
consisting of 8 molecular species and 1 1 chemical reactions. For this simplified network, it is 
possible to derive time-dependent analytical results for the mean protein number and coefficient 
of variation in protein number Importantly, by making the appropriate approximations, the 
effects of gene replication and cell division can be included in the analytical solutions. The 
reduced model have two components - one described by the reactions in Eqs. (|3]l-(j6]l (note that 
the reaction rates vi and t/o can be directly related to Vj and d^ in the original model |33|), and 
another describing pre-transcription kinetics. This component captures the reversible binding of 
RNAP to the promoter (rate constants bo and /o), and the formation of an open promoter complex 
(rate constant ko). These steps are described by the reactions 

D ^ C (18) 

bo 

C D + T (19) 

where D, C and T represent the promoter with polymerase unbound, the promoter with poly- 
merase bound and the open promoter complex, respectively. Since the total number n of DNA 
molecules is conserved before and after replication, D and C can be constrained by 

«() + «! - n, (20) 

where no and ni are the number of promoter copies in state D and C respectively. 

To derive an analytical solution, the authors invoked the assumption that the distributions of 
C, T, and mRNA can be approximated by their steady state distributions. While this assumption 
thus ignores the transient dynamics of these species, it is expected to introduce a minimal error 
since the protein degradation rate di is much smaller compared to the other reaction rates. As 
a consequence, the mean and coefficient of variation protein P are time-dependent while the 
moments of the distributions of the other species are constant. Even with this approximation, the 
derivation of the analytical solutions for the mean and coeflicient of variation is rather arduous. 
In the following, we highlight the only the main points (the complete derivation can be found 
in the supplementary material of Swain et al. Il33l ). It consists of three separate stages - the 
derivation of time-dependent expression for the population mean and noise, the incorporation of 
gene replication and the addition of cell division. 

The first stage is analogous to the derivation of time-dependent moments in Section [5T| that 
is, cell cycle effects are neglected and the probability distributions for the species C,T,mRNA, 
and P is described using the CME. In this case, the variables n[,n2,n3, and n4 are used to describe 
the numbers of C,T,mRNA, and P, respectively, and /?(«!, n2, "3, "4,0 denotes the probability 
density function of the time-dependent state. The CME can be correspondingly be written in the 
form 

dpini,n2,n3,n4,t) 

= fo[(n- ni + l)p(ni - I,n2,n3,n4,t) 

ot 

— (« — ni)p{ni,n2,ni,n4,t)] + ■■ ■ , (21) 

where dots denote similar terms, one for each rate constant. The CME is then used to derive an 
expression for the time-dependent probability-generating function. The probability-generating 
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function is defined by 



E 



Ml n? m riA f 4.\ 



(22) 



It can easily be seen that differentiating F with respect to z, and setting all z, to unity, gives yU„ 
and similarly the second derivative gives yU„ („,_i). Applying the transformation given by Eq. (p2| 



to the CME (Eq. (21 1), an expression for the probability-generating function can be obtained. 
This expression has the form of the partial differential equation 

dF dF dF 

— = fmwF -[fawil +w) + bQW -kQ{x-w)\ — + VQ{y - x) — 
at ow ox 



■ [v\z(l +y)- t/oj] ^ - diz 



dF 



OF 



(23) 



where w = zi - l,x = Z2 - !,>' = 23 -1, and z = Z4 - 1. This equation, just like the CME, 
is practically impossible to solve. However, the equation can be combined with a second order 
Taylor expansion of Eq. ( 22 1 which can be written in the form 

Fiw, X, y,z,t) - 1 + wXi + XX2 + yXj + 2X4(0 + ^ 1 + X22X^ 

+X33y^ + ^44(02' + 2X1 2WX + 2Xiiwy + 2X22xy 

+2Xu(t)wz + 2X24(0 + 2X34(0yzl, (24) 

where the expansion is taken around w = 0, x = 0, y = 0, z = so that the following holds: 
Xj - fi„., Xii - ju„2 -//„,, and Xij = fi„.„ , i + j. Here it is important to note that only the processes 
involving protein molecules are time-dependent according to the previous assumptions. The Eq. 
(24 1 is then substituted to Eq. (23 1, the coefficients are compared and solvable expressions for the 
expected values, variances, and covariances of the considered process are obtained. This gives 
equations governing the variables X4 - and X44 - fip(p-i) 



dX^it) 
dt 

dX44(t) 

dt 



v\X3 - d]X4(t), 

2v'iX34(0 - 2diX44(t). 



(25) 
(26) 



Assuming that fJ.p{Q) - m, Eqs. (25 1 and (26 1 can be solved using expressions for the other 
Xjj variables. The expressions are rather complex and the interested reader should refer to 1331 . 
Solving Eqs. (25 1 and (26 1 yields the following expressions for the protein mean and variance 

V1X3 



di 



where 



and 



v',/o^on 
df^d\ I 



dx 



d'o- 



■ d\ 



'?33 + 



di + V() 
10 



^23 + 



di + 1 



Tin 



(27) 
(28) 

(29) 
(30) 



Note that Q is a measure of the mRNA fluctuations, I - fo + bo + ko, and that rj^j is given by 

'7,7 = ■ (31) 

The effects of gene repHcation are incorporated in the second stage of the derivation. The 
number of proteins at the beginning of each cell cycle is determined by the time evolution of 
the system during the cycle of a parent cell. To assess the time evolution of protein molecules 
during the cell cycle, the probability q„\m(t) of having n proteins at time t, given that there were 
m proteins at time f = is defined and the probability-generating function Qmiz, t) for this 
distribution is constructed. By definition, the generating function has the form 

Q,n(z,t)^J]qn\m{t)z". (32) 

n 

The equation can be expanded around z = 1 which yields 

Q,„(z, t)=l+(z- 1)^P + ^(z - ifW^ -^ip] + --- (33) 
This function can be determined up to the necessary level by means of equations fip(t) and crj,(t). 



Using Eq. 33 it is obtained that 

Q,n(z, t) = eo(z, t) [l - e-'^' + ze-'"Y . (34) 

Because the gene replication occurs at time t - tj, two different forms of Qmiz, t) have to be 
considered; Qm\z, t) which is valid when the gene number is n, and Q^miz, t) which is vahd 
when the gene number is 2n. Thus 

Q^2iz, t) = e®(z, t)[Y + z{\- Y)r , (35) 

where 7=1- e Now it is possible to proceed to the third stage of the derivation where cell 
division is included. 

The third stage incorporates cell division. Cell division is in the model assumed to occur at 
fixed intervals given by the division time T^. When t - it is assumed that each protein has a 
50 % probability of being kept in this cell (symmetric division) and the probability of having n 
proteins immediately after the division is the binomial 



2-"' (36) 



given that there are m proteins just before cell division. The transfer probability from one cell 
cycle to another can be constructed by combining the binomial distribution with the protein dis- 



tribution derived earlier (Eq. 24 1. After many divisions, the protein number tends to a limit cycle 
and expressions for the mRNA and protein mean and coefficient of variation can be obtained in 
the limit di/d^ « 1. Through a fairly complicated set of steps, it can be shown f33l that the 
mean mRNA number before gene duplication (t < td), and the mRNA coefficient of variation are 
given by 

2 _ ^o^'»(<) + ^ + ^o) 

t^mRNA n(df^ + l){l + Vo)(«o + ^O) 
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The mean protein number and coefficient of variation in protein number as functions of time can 
be derived as 



rilit) 



di 



1 



1 



1 - 



foko 



where 



and 



01 (f) = 



1 - 



forQ<t<td 



2 + e 



-diT 



4_g-2<i] T _2c-2''l '-e-^"*! C+'-'d) 
(2-c-<'i''~e-''i<''+'-'rf>)^ ' 



for 0<t<td 
for td<t<T. 



(39) 
(40) 

(41) 

(42) 



In Eqs. (41 



i 2(2-£.-''i''-e-''i<'-V))- 

and (42 1, and T denote the gene repHcation time and the cell division time. 



respectively. 



It is noted that Eqs. ( 37 1 and ( 38 1 are time independent and that the value of the mean is twice 



this result after gene replication occurs (i.e. when t > td). The time independence follows from 
the assumption that the RNA is in a quasi-steady state proportional to the gene copy number n, 
and that all other time dependencies are absorbed into the protein distribution. 

Our simulation results are compared to the corresponding steady-state and time-dependent 
analytical solutions (Figs. |4]-|6]l. In these simulations, we use the same assumptions as in Il33ll : the 
cell volume increases linearly up to time of cell division T, gene replication occurs at trep - OAT 
and cell division is symmetric with binomial partitioning of molecules. Simulated protein num- 
ber and concentration, as well as mRNA number dynamics, for single cells (Fig. |4]i are compa- 
rable with the simulation results obtained by Swain et al. [331 . Figures |5] and |6] further compare 
population characteristics estimated from simulations to those predicted by the corresponding 
steady-state analytical solutions. Both RNA (jimRNAin) and ?7^,^;v^(n), Fig. [sjl and protein (jip{t) 
and rj^p, Fig.|6]) characteristics are in good agreement with the analytical results (Eqs. (37 1-(42 1). 



6. Simulating complex population dynamics 

6.1. Asymmetric Cell Division 

To investigate sources of external variability in eukaryotic gene expression, Volfson et al. Il35l 
combined computational modelling with fluorescence data. As part of this study, the authors 
simulated the distribution of cell sizes within a population of Saccharomyces cerevisiae (bud- 
ding yeast). In these simulations, cells grew exponentially until they reached a critical volume 
Vc where they divide. The volume at division was drawn from a normal distribution with a 
mean specified as a function of genealogical age and coefficient of variation 0.15. Following 
division, the mother cell retained 70 % of the volume (Vq - 0.7yc) while daughter cells were 
correspondingly smaller (Vq - 0.3yc). The resulting distribution of cell sizes obtained from an 
initial population of 1000 cells allowed to grow to 100000 cells was found to be in agreement 
with experimental and analytical results ||35]| . 
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The model by Volfson et al. fi5\ is ideally suited for benchmarking the constant-number MC 
method. As in Volfson et al. [35J . we first simulated the growth of a population initially con- 
sisting of 1000 cells and obtained the steady-state size distribution once the population grew to 
100000 cells (Fig.|7^). Next, we repeated the simulations using the constant-number MC method 
to estimate the size distribution from a representative sample (8000 cells) of this cell population 
(Fig. [T]^). a plot of the probabilities for the sample population against the probabilities of the 
'true' population shows that the difference between these variables is minimal (Fig. |7};). These 
results compliment previous studies lfT6l [181 1211 12211311 demonstrating the ability of the constant- 
number MC method to capture complex population dynamics. 

6.2. Bet-Hedging Cell Populations 

One of the most interesting potential applications of the simulation algorithm described in 
Section[4]is investigations of interactions between environmental changes, population dynamics 
and gene expression in individual cells. For example, it can be used to study the optimization 
of fitness in fluctuating environments, which is a classic problem in evolutionary and population 
biology ||4j |T7| ED [32l . Acar et al. [1| experimentally investigated how stochastic switching 
between phenotypes in changing environments affected growth rates in fast and slow-switching 
populations by using the galactose utilization network in Saccharomyces cerevisiae. Specifi- 
cally, a strain was engineered to randomly transition between two phenotypes {ON and OFF) 
characterized by high or low expression of a gene encoding the Ura3 enzyme necessary for uracil 
biosynthesis. Each phenotype was designed to have a growth advantage over the other in one of 
two environments. In the first environment {E\) which lacks uracil, cells in the ON phenotype 
have an advantage. In the second environment (£2), cells in the OFF phenotype have an advan- 
tage due to the presence of a drug (5-FOA) which is converted into a toxin by the Ura3 enzyme. 
In this environment, which also contains uracil, cells expressing Ura3 will have low viability 
while cells not expression Ura3 will grow normally. 

Models of gene expression often describe the promoter T as being in one of two states: a 
repressed state Tr (basal level of gene expression) or an active state Ta (upregulated level of 
gene expression) corresponding respectively to OFF and ON phenotypes. This can be described 
by the following biochemical reaction scheme 1. 1 1 .1 : 



Ta ^ Tr, (43) 
ki 

Ta^Ta + mRNA (44) 

Tr^Tr + mRNA (45) 

mRNA (46) 

mRNA A mRNA + P (47) 

P-^0 (48) 



where Eq. (43 1 describes the transitions to the Ta and Tr promoter states at rates ^1 and k2 



respectively, Eqs. (44i and (45 1 the mRNA production from the Ta (at a rate vq^) and Tr (at 



a rate vqr) states respectively, Eq. (47i the protein production from mRNA at a rate vi, and 
Eqs. (46 1 and ( |48] l respectively the mRNA (at a rate c/q) and protein (at a rate di) degradation. 
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We first follow the approach that was used in Acar et al. f\\ to describe the dynamics of 
phenotype switching, where cells are in either the ON or the OFF state: 



ON ^ OFF (49) 
ki 

In this scenario, cells randomly switch between the high and low expressing states at rates 
and k2 (see 1 1 1 for parameter values corresponding to slow and fast-switching cells). The growth 
rate (Eq. (j2])) of fit cells was set higher than the corresponding growth rate for unfit cells in the 
same environment. In order to avoid synchronization in the population level dynamics, we set 
Vdiv = 2Vo + where ^ is a small random number drawn from a normal distribution with zero 
mean and 0.2 variance. 

Figure [8] shows the growth rates obtained from simulations of slow and fast-switching cell 
populations, where cells were transfered from E2 to El, and vice versa, at f = 0. Growth rates 
show a transition period and a steady-state region. In agreement with experiments (see Acar et 
al. IH), fast-switching cells were found to recover from the effect of environment change faster 
than slow-switching cells but have a lower steady-state growth rate. 



Next we implemented the full model of gene expression described by Eqs. (43l-(48l. The 
fitness Wk of each cell k, which is here defined as a function of the environment and cellular 
protein concentration [P], was described by a Hill function 

This equation describes partitioning of cells into fit (wii{E, P) > 0.5) and unfit (wa(£', P) < 0.5) 
phenotypes corresponding to whether or not their [P] in a particular environment is above or 
below a particular value given by the Hill coeflicient K. The volume of each cell was described 
by Eq. Q, except here tq = t^/w, where is the cell division time in absence of selection. To 
incorporate the effect of fitness on gene expression, the value of transcription rate parameter vq 
depended on whether or not a cell was fit in either £1 or E2 (see Fig.|9]for parameters). 

The population distributions obtained for this model are shown in Figure [9] Specifically, we 
first obtained the steady-state protein concentration distributions for cells in £1 and E2 (Fig. 
and|9]3 respectively). Here, the majority of cells either fell within a distribution centered at higher 
value characterizing the ON cells, or a distribution centered at a lower value of P characterizing 
the OFF cells, in £1 or E2 respectively. The rest of the cells fell within the distribution capturing 
the unfit subpopulation in both environments. These results were found experimentally in 0] and 
are expected, as higher levels of the uracil enzyme are either favorable or unfavorable with respect 
to the fitness of the cells depending on the environment. Next, the time-dependent population 
distributions after the transition to £1 from E2, and vice versa, were obtained (Fig. and|9}) 
respectively). Here, the dynamics of the two distinct subpopulations of cells in transition between 
the steady-states are visible. As time progresses after the environmental transition, less and less 
of the cells are in the unfit state (ON in Fig. and OFF in Fig. as the cells in the more fit 
state (OFF in Fig. and ON in Fig. |9|5) grow and divide at a faster rate and therefore come to 
dominate the population in terms of absolute numbers. 
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7. Conclusions 

We have presented a framework for the stochastic simulation of heterogeneous population 
dynamics. The accuracy of the method was verified by comparing simulation results of stochastic 
gene expression and population dynamics with corresponding steady-state and time-dependent 
analytical solutions and experimental results. Parallel execution of the algorithm was found to 
significantly decrease run-times in comparison to simulations run on a single processor, and did 
not introduce errors in numerical results. 

The algorithm was also shown to be capable of simulating and capturing the dynamics of 
a cell population in a fluctuating environment, where phenotypic variability strongly influences 
gene expression dynamics. Agreement between this framework and the experimental and theo- 
retical results obtained using a deterministic reaction-rate method in Acar et al. |[T], serves as a 
further benchmark for the proposed method. Furthermore, the algorithm's ability to capture the 
steady-state and time-independent phenotypic distributions in this system exemplifies the utility 
of this approach, as these distributions cannot be obtained using a deterministic framework. 

Current cellular population simulation methods, including the present algorithm, treat the 
extracellular environment as homogeneous (e.g. the spatial-temporal concentration profile of a 
nutrient required for growth is held constant). This prohibits, for example, the inclusion of 
competition for a limiting resource in the present implementation. However, it is possible to 
model feedback between cells and their environment. The simplest approach would be to assume 
that the environment is constant over short time intervals. The change in total population cell 
volume at the end of each interval could then be used to calculate how much nutrients have 
been consumed and the parameters describing the environment adjusted accordingly. Since the 
time intervals would have to be sufficiently short so that the change in concentration of the 
nutrient during any particular interval is negligible, the computational workload would increase 
substantially. The focus of future work will be on developing and benchmarking accurate and 
efficient augmentations that permit population simulators to handle these and other more complex 
scenarios. 
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Figure 1: Flow diagram of the present algorithm for the parallel stochastic simulation of gene expression and heteroge- 
neous population dynamics. 
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Figure 2: Simulation results and time-dependent analytical solutions of a two-stage model of gene expression (301. The 
distribution of protein numbers for a population of cells at two different dimensionless times, r = 0.2 and r = 10, is 
shown. 
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Figure 3 : Simulation results and time-dependent analytical solutions of a two-stage model of gene expression 1 30 1 . Mean 
protein ^ip (top) and noise Tfp (bottom) are plotted as a function of dimensionless time r. Red dots indicate simulation 
results and black curves analytical solutions 1301 . 
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Figure 4: Time series of a single cell within a growing and dividing population. Protein number (top) and concentration 
(middle), and mRNA number (bottom), were obtained and found to be in agreement with a model of translation provided 
in 1 33 1 . Gene duplication occurs every tj = OAT into the cell cycle and results in an increased rate of protein production 
until the next cell division event where the number of genes prior to duplication is restored. 
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Figure 5: Comparison of simulation results and analytic solutions. Mean mRNA values are plotted as a function of gene 
copy number n (top). The noise in mRNA number is also plotted as a function of n (bottom). Note that mean mRNA 
values increase and the noise decreases after gene duplication as expected. Black curves indicate analytical values 1331 
and red dots simulation results. 
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Figure 6: Comparison of simulation results and analytic solutions. Mean protein number (top) and noise (bottom) as a 
function of time t for two different values of the protein degradation parameter d\ . Note the increase in protein production 
rate and decrease in noise levels that occurs after gene duplication at f = 0.4. Red dots indicate simulation results and 
black curves analytical values L33i . 
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Figure 7: Simulation of a stochastic population dyanmics model |35 1 of a Saccharomyces cerevisiae population under- 
going stochastic (size at division) and asymmetric (partitioning of cell volume) division, (a) Steady-state distribution 
of cell sizes for a population of 100000 cells, (b) Steady-state size distiibution of a representative sample (8000 cells) 
obtained using the constant-number Monte Carlo method I18II31I of the 'true' population shown in (a), (c) Plot of the 
probabilities population shown in (b) against the probabilities of the population shown in (a) along with linear regression. 
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Figure 8: Simulations of populations of slow and fast-switching cells, (a) Growth rates of cells transfered from an 
environment containing uracil and 5-FOA (E2) to one containing no uracil (El) at ^ = 0. (b) Growth rates of cells 
transfered from El to E2 at ; = 0. Note that the transient before the steady-state region is shorter in (a) than in (b), and 
that fast-switching cells recover faster from the environment change but slow-switching cells have a higher steady-state 
growth, in agreement with experimental results found in [IJ. 
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Figure 9: Simulations of environmental effects on phenotypic distribution, (a) Steady-state (top and bottom figures) and 
time-dependent (middle figures) protein distributions of cells resulting from an environment change from El to E2. (b) 
Steady-state (top and bottom figures) and time-dependent (middle figures) protein distributions of cells resulting from 
an environment change from E2 to El. Note that when a sufficient amount of time has elapsed after the environmental 
transition from either El to E2 or vice versa, cells with either the OFF or ON phenotype proliferate, respectively, in 
agreement with experimental results found in 1 1 1 . The following parameters were used (units .v"'): do = 0.005, V| =0.1, 
d[ = 0.008, K = 200, n = 10. In El vq a = 0.2 for fit cells and i'o.r = 0.05 for unfit cells - vice versa in E2. Additionally 
was set to the mean doubling time (MDT) of 1.5 hours for Saccharomyces cerevisiae (3). 
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